Case Study 2: Spotify’s best of 2017

authors: Jakub, Luisa, Max and Stefan

date: 20/01/2019


Analysis of Dataset 1

What features can we predict through a combination of other features?

Predicting loudness with energy and speechiness

Mulitvariate Linear Model

## 
## Call:
## lm(formula = loudness ~ energy + speechiness, data = dataset_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4517 -0.7169  0.0492  0.7903  2.4026 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.5590     0.6163 -17.132  < 2e-16 ***
## energy        8.3505     0.8550   9.766 4.29e-16 ***
## speechiness  -5.8741     1.2514  -4.694 8.80e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.159 on 97 degrees of freedom
## Multiple R-squared:  0.5949, Adjusted R-squared:  0.5865 
## F-statistic: 71.22 on 2 and 97 DF,  p-value: < 2.2e-16

Model Assessment

  • p-values are very small
  • t-values are distinct from 0
  • 58% of the variance in the loudness are explained by the model (adjusted R-square)
  • 12.8% error rate (based on span of loudness of 9.066 and residual standard err rate of 1.159)
  • F-statistic is larger than 1 (71.22), so we can reject H0 (no relationship between loudness energy and speechiness)
  • variance of the residuals is equal

Visualization of the Model

Predicting valence with danceability and loudness

Multivariate Linear Model

## 
## Call:
## lm(formula = valence ~ danceability + loudness, data = dataset_1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51835 -0.14305  0.01909  0.12715  0.39991 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.29345    0.11924   2.461   0.0156 *  
## danceability  0.69742    0.14420   4.836 4.97e-06 ***
## loudness      0.04642    0.01001   4.638 1.10e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1792 on 97 degrees of freedom
## Multiple R-squared:  0.3282, Adjusted R-squared:  0.3144 
## F-statistic:  23.7 on 2 and 97 DF,  p-value: 4.168e-09

Assessment of the model

  • p-values are very small
  • t-values are distinct from 0
  • 31% of the variance in the valence are explained by the model (adjusted R-square)
  • 20% error rate (based on span of valence of 0.798 and residual standard err rate of .1792)
  • F-statistic is larger than 1 (23.7), so we can reject H0 (no relationship between loudness energy and speechiness)
  • variance of residuals is equal

Visualization of the Linear Model

Can we compress the data into more meaningfull variables?

Principle component analysis

##                     PC1    PC2    PC3    PC4    PC5    PC6
## danceability     -0.181  0.477 -0.143  0.459 -0.045 -0.229
## energy           -0.490 -0.326 -0.040  0.018  0.014 -0.187
## key               0.027 -0.018 -0.499 -0.450 -0.090 -0.306
## loudness         -0.574 -0.165  0.077 -0.098  0.043 -0.025
## mode              0.121 -0.027  0.561  0.249  0.251 -0.004
## speechiness       0.319  0.084 -0.423  0.443  0.045 -0.109
## acousticness      0.072  0.435  0.242 -0.354 -0.083  0.292
## instrumentalness  0.015 -0.209  0.265  0.250 -0.642 -0.190
## liveness         -0.084 -0.207 -0.153  0.186  0.597  0.274
## valence          -0.441  0.299 -0.026  0.218 -0.023 -0.042
## tempo             0.228 -0.510 -0.028  0.147 -0.073 -0.023
## duration_min      0.144  0.079  0.275 -0.166  0.375 -0.780
##                   1.524  1.329  1.119  1.070  1.019  0.982

Based on eigenvalues > 1, we select 3 main factors (factor must explain more than individual variables).

  1. Factor 1: energy, loudness and valence reinforce each other. We could interpret this as positive energy of song.
  2. Factor 2: danceability, lower tempo and acousticness. We could interpret this as rythmicity.
  3. Factor 3: mode, lower key and speechiness. We could interpret this as tonality.

Assessment of the analysis

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = dataset_1[, c(4:15)])
## Overall MSA =  0.54
## MSA for each item = 
##     danceability           energy              key         loudness 
##             0.48             0.54             0.40             0.55 
##             mode      speechiness     acousticness instrumentalness 
##             0.53             0.46             0.55             0.24 
##         liveness          valence            tempo     duration_min 
##             0.51             0.66             0.64             0.46
## [1] "PCA Condition 1: the KMO coefficient 0.536 must be above 0.5 in order to justify the appropriateness of PCA"
## [1] "PCA Condition 2: the p-value from barlett test 0 must be below 0.05 to reject the null that the population correlation matrix is an identity matrix and continue with PCA"

While this analysis provides some insights and supports our summary statistics, we will not elaborate on PCA further because of the KMO coefficient (.54) is not very strong, indicating that PCA analysis might not be the best in this case.

What song should Spotify suggest to a new, previously unknown listener, based on our popularity coefficient?

Artist: 16 artists appear in the TOP 100 more than 1 time

We identified a range of external factors that must be taken into account when looking at the number of appearance:

  • the number of artist followers, which allows the artist to have a wider reach
  • release frequency and date (new song and album releases)
  • branding effect (features, artist name, promoted position i.e. Spotify-curated artists, context-based promotion)
  • spotify algorithm: Spotify relies on revenue from the number of users, they need to promote popular music to maximise revenue

In 2017, leading artists (Ed Sheeran, Louis Fonsi, Chainsmokers) released highly successful albums or singles.

When it comes to Spotify itself, it is a streaming service whose users consume music - they want new songs and it is exactly new songs that make it in the top 100 of the year. Consumers want to consume, producers must produce in order to claim success.

artist no_of_appearance genre
Ed Sheeran 4 pop
The Chainsmokers 4 pop,electropop
Drake 3 hip hop,R&B
Martin Garrix 3 electronic,house,progressive house
Bruno Mars 2 R&B,funk,pop,soul,reggae,hip hop,rock
Calvin Harris 2 electronic
Clean Bandit 2 electronic
DJ Khaled 2 hip hop
Imagine Dragons 2 pop rock
Kendrick Lamar 2 hip hop
Khalid 2 R&B
Luis Fonsi 2 latin,pop
Maroon 5 2 pop,pop rock
Post Malone 2 hip hop
The Weeknd 2 R&B,pop,hip hop
ZAYN 2 R&B,pop
## [1] "16 artists share 38 % of TOP 100 songs, making it 2.375 songs per super popular artist"

note: genres were taken from artists’ profiles on Wikipedia as it was impossible to integrate the TOP100 dataset with the Spotify API

Top genres of top 16 artists

The music genres of the most popular artists on Spotify are pop, hip hop and R&B. Could these be factors for writing a popular song?

## # A tibble: 5 x 2
##        genre     n
##        <chr> <int>
## 1        pop     7
## 2    hip hop     6
## 3        r&b     5
## 4 electronic     3
## 5   pop rock     2

Distribution of different metrics of popularity

Weighing the distribution of the features by the total number of streams

Are there patterns in the features?

Computation of a custom popularity score

#popularity algorithm based on the number of the number of streams and the actual number of days the song could have stayed in the top 200 for
master_stats <- master_dataset_gl  %>% group_by(Id = master_dataset_gl$id, Artist = master_dataset_gl$artists, Song = master_dataset_gl$name)  %>% 
  summarise(MinDate = min(Date), 
            TotalStreams = sum(Streams), 
            NumberOfDays = sum(count), 
            StreamsPerDay = TotalStreams/NumberOfDays, 
          Popularity = (TotalStreams/AllStreams)*(NumberOfDays/as.numeric((as.Date('2018-01-01')-min(Date))))) %>% arrange(TotalStreams, desc(TotalStreams))
Refactoring summary statistics with popularity coefficient
x
danceability 0.7027656
energy 0.6637505
key 5.0563206
loudness -5.4997869
mode 0.5532670
speechiness 0.0950520
acousticness 0.1808356
instrumentalness 0.0049967
liveness 0.1454561
valence 0.5470566
tempo 116.6150685
duration_min 3.6988909

Answers:

  1. Yes. On average, songs that make it to the Top 100 have been in the Top 200 for a long time, for an average of 235 days and 50% of the songs entered the TOP 200 in the first 2 months of the year.
  2. All Night by The Vamps made it in the TOP 100 with 14.61 million streams accumulated over 25 days!
  3. Shape of You and Despacito share staggering 9% of the TOP 100 streams! 12 songs share 23.5% of TOP 100 market by streams.
  4. Most dominant artists of 2017 are Ed Sheeran, The Chainsmokers and Luis Fonsi. The most streamed artist, Ed Sheeran, managed to secure 56% more streams than Chainsmokers who came second! Only 5 artists share about 24% of the TOP 100 stream market share.
  5. Rockstar by Post Malone was most successful song measured on a stream per day basis. It averaged 5 million daily streams over 92 days.
rm(list=setdiff(ls(), "path"))

Dataset 2

Meta:

Load necessary packages, that haven’t been loaded before and the repsective dataset first

Before jumping into the analysis and answering questions around dataset II, we will try to understand the structure and content of this dataset.

We can see that we are dealing with a pretty long dataset consisting of more than 3.4 million rows and merely 7 columns. For each day of the year in a given region the dataset records the top 200 artists, tracks and the number of respective streams

summary(dataset_2)
##     Position       Track Name           Artist             Streams        
##  Min.   :  1.00   Length:3441197     Length:3441197     Min.   :    1001  
##  1st Qu.: 45.00   Class :character   Class :character   1st Qu.:    3322  
##  Median : 92.00   Mode  :character   Mode  :character   Median :    9227  
##  Mean   : 94.64                                         Mean   :   51892  
##  3rd Qu.:143.00                                         3rd Qu.:   29658  
##  Max.   :200.00                                         Max.   :11381520  
##      URL                Date              Region         
##  Length:3441197     Length:3441197     Length:3441197    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 
head(dataset_2)
##    Position                 Track Name        Artist Streams
## 1:        1 Reggaetón Lento (Bailemos)          CNCO   19272
## 2:        2                   Chantaje       Shakira   19270
## 3:        3  Otra Vez (feat. J Balvin) Zion & Lennox   15761
## 4:        4               Vente Pa' Ca  Ricky Martin   14954
## 5:        5                     Safari      J Balvin   14269
## 6:        6               La Bicicleta  Carlos Vives   12843
##                                                      URL       Date Region
## 1: https://open.spotify.com/track/3AEZUABDXNtecAOSC1qTfo 2017-01-01     ec
## 2: https://open.spotify.com/track/6mICuAdrwEjh6Y6lroV2Kg 2017-01-01     ec
## 3: https://open.spotify.com/track/3QwBODjSEzelZyVjxPOHdq 2017-01-01     ec
## 4: https://open.spotify.com/track/7DM4BPaS7uofFul3ywMe46 2017-01-01     ec
## 5: https://open.spotify.com/track/6rQSrBHf7HlZjtcMZ4S4bO 2017-01-01     ec
## 6: https://open.spotify.com/track/0sXvAOmXgjR2QUqLK1MltU 2017-01-01     ec

Introduction and Basic Cleaning

Considering the data provided, we are interested in making any guesses on behavior over time or relations between regions and continents. Taking a look at the structure we see that the data set is pretty clean, meaningfully named and doesn’t contain any NAs. To make life a little easier we do a little bit of cleaning:

colnames(dataset_2)<-gsub(" ", "", colnames(dataset_2), fixed = TRUE) #remove blankspace
dataset_2$Date<-as.Date(dataset_2$Date) #parse as date
daily_spotify<-copy(dataset_2) # copy into new data.table, we need it later

Predicting Tomorrows Ranking

Firstly it could be interesting to check, whether it is possible to guess the movement of a track in the top 200 and make predictions on its ranking tomorrow or even after. To address this question from a top level perspective we simply plot the movements of tracks of artists over time. Since the US is the biggest market of Spotify, we will subset our data on this region to reduce dimensionality and complexity.

Now we can plot the movement of tracks for a artist and compare a few of those among the top 200.

plot_RankingPattern("Ed Sheeran")

plot_RankingPattern("Drake")

plot_RankingPattern("Post Malone")

plot_RankingPattern("Calvin Harris")

What we can see in the plots is that there are certain patterns but the movement of a song is quite different from song to song and also seems to be influenced by the artist and other factors. For example back in March 2017 when Ed Sheera dropped his album suddenly many tracks entered the top 200 of which some left after some days only and Shape of you survived till the end of you, suddenly even getting boosted again (maybe due to top 2017 playlists and year in music reviews.). It seems that some artists are able to virally drop a lot of their releases at the top positions (Ed Sheeran, Drake), while others steadily climb up the charts

A trivial approach to predict tomorrows ranking based on the graphs would be to assume that the song will rank on the same place or a little below. It seems to be too complicated to project a songs ranking position of tomorrow, thus we will skip this topic.

Ranking Persistance

We want to calculate for how many days a certain song stays in the top X position of the top 200. Looking at the plots from above we assume that track slowly but continously leaves the top X. We do not count in such way that a song has to continously day by day. For sakes of simplicity we do a simple count.

The function will be run for various Top-rankings (Top 3, Top 5, …) and for all regions in the initial dataset. We enter all values into a new datatable and add the information, to which continent a country belongs by applying the countrycode package. Finally, the figures are plotted for various TopN-Rankings. We can immediately see, that countries in the American continent listen to the same tracks way longer than countries in Europe.

# Function that counts appearance of a track in a selected subset of TopN for a selected country
# Result is a mean over all tracks in selected subset

stay_in_ranking<-function (topN, region){
  daily_N_region<-daily_spotify[Position<(topN+1)& daily_spotify$Region %in% region,.(Artist,Streams,TrackName,URL)]
  gtN<-daily_N_region %>% count(TrackName,sort=T)
  mean(gtN$n)
}

# We now want to apply this function to Top3, Top5,... and do this for every region

#We create a vector of all desired topN rankings and a columnName vector to build a data.table around that
rankingN<-c(3,5,10,15,20,50)
rankingName<-c("Top3","Top5","Top10","Top15","Top20","Top50")
all_regions<-unique(daily_spotify$Region)
region_ranking_persistance<-data.table(country=all_regions)

for (i in 1:length(rankingName)){
  region_ranking_persistance[,rankingName[i]:=numeric()]
}

# now we can call the stay_in_Ranking function for each cell in the data table respectively
# attention 
for (i in 1:nrow(region_ranking_persistance)){
  region<-region_ranking_persistance[i,1]
  for (j in 2:ncol(region_ranking_persistance)){
    region_ranking_persistance[i,j]<-stay_in_ranking(rankingN[j-1], region=region)
  }
}

#we add continent information to the result
region_ranking_persistance$continent<-countrycode(sourcevar = region_ranking_persistance[, country],
                                            origin = "iso2c",
                                            destination = "continent")  

# and plot the data for the Top 5, Top10, Top 50
ggplot(region_ranking_persistance,aes(country,Top5,x=reorder(country,-Top5),y=Top5))+geom_bar(stat="identity",aes(fill=continent))+labs(title="Ranking Persistance of Songs over countries",x="Country",y="Number of days a song stays in Top 5")+theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(region_ranking_persistance,aes(country,Top10,x=reorder(country,-Top10),y=Top10))+geom_bar(stat="identity",aes(fill=continent))+labs(title="Ranking Persistance of Songs over countries",x="Country",y="Number of days a song stays in Top 10")+theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(region_ranking_persistance,aes(country,Top50,x=reorder(country,-Top50),y=Top50))+geom_bar(stat="identity",aes(fill=continent))+labs(title="Ranking Persistance of Songs over countries",x="Country",y="Number of days a song stays in Top 50")+theme(axis.text.x = element_text(angle = 90, hjust = 1))

We can derive that globally, for the majority of a time a succesful song stays very high in the ranking and then suddenly looses momentum. This relates the impressions from the first plots we did here. A succesful song seems to virally enter top positions, stay there for quite some time and then gradually descends.

Similarity between continents

The RPackage countrycode will help us to retrieve the continent for each region, which are represented by iso2-codes. We add the continent information into a new data.table.

# copy dataset and exclude all "global" regions
dataset_2_continents<-as.data.table(dataset_2[Region!="global"])

# add new column and add continent of respective ISO-code of country
dataset_2_continents$continent<-countrycode(sourcevar = dataset_2_continents[, Region],
                                      origin = "iso2c",
                                      destination = "continent")  

With this information we can now calculate all possible combinations of two continents, which aren’t that many since only four continents are included in the dataset. We do this performing the combinations function of the gtools package on all unique continents of the dataset.

Subsequently we calculate both the Similarity between all unique tracks and unique artists of two continents. We do this by dividing the intersection set of two continents by the union set of two continents.

In the final result we can see, that continents actually differ quite a lot. A similarity of around 20% is the maximum.

# gather all possible combinations of continents
continent_comparison<-as.data.table(combinations(length(unique(dataset_2_continents$continent))
                                                 ,2,unique(dataset_2_continents$continent)))

# measure track similarity
# similarity: defined as interesect set divided by union set. 

for (i in 1:nrow(continent_comparison)){
  continenta<-continent_comparison[i,V1]
  continentb<-continent_comparison[i,V2]
  subseta<-dataset_2_continents[continent %in% continenta]
  subsetb<-dataset_2_continents[continent %in% continentb]
  unique_a<-unique(subseta$TrackName)
  unique_b<-unique(subsetb$TrackName)
  continent_comparison[i,Track_similarity:=length(intersect(unique_a, unique_b))/
                  (length(unique_a)+length(unique_b)-length(intersect(unique_a, unique_b)))]
}

# measure Artist similarity
for (i in 1:nrow(continent_comparison)){
  continenta<-continent_comparison[i,V1]
  continentb<-continent_comparison[i,V2]
  subseta<-dataset_2_continents[continent %in% continenta]
  subsetb<-dataset_2_continents[continent %in% continentb]
  unique_a<-unique(subseta$Artist)
  unique_b<-unique(subsetb$Artist)
  continent_comparison[i,Artist_similarity:=length(intersect(unique_a, unique_b))/
                         (length(unique_a)+length(unique_b)-length(intersect(unique_a, unique_b)))]
}


print(continent_comparison)
##          V1      V2 Track_similarity Artist_similarity
## 1: Americas    Asia       0.12406810         0.1301722
## 2: Americas  Europe       0.13295790         0.1384055
## 3: Americas Oceania       0.19256757         0.2169648
## 4:     Asia  Europe       0.10115277         0.1076869
## 5:     Asia Oceania       0.14378217         0.1455954
## 6:   Europe Oceania       0.09522632         0.1123489

Influence of country-country distance on time lag and similarity of top 200.

In order to answer this question, we need to do three substeps. First, we need a list of all possible combinations of two countries in the spotify dataset. Second, we want to calculate the distance between all these possbible combinations. And finally, we want to calculate the similarity and time-lag between the rankings of these two countries.

Our initial hypothesis is, that we live in a globalized and connected world which will also reflect in pretty similar music rankings between countries and a fast spread of new top tracks globally.

While R packages can help for the first two steps, we have to elaborate on metrics to compare two countries. We will calculate similarity in the same way we did above for continents. On top of that we calculate the measure time-lag as the difference of days between the date of Song X appearing in the ranking of country A vs country B. On top of that we will calculate the share of total songs that appear in the top 200 on the same day in both countries (day_zero)

We beginn calculating this for top 10 streaming countries.

First step

Calculate all possible combinations of two countries of the top 10 streaming countries and store result in data.table. We obtain 45 possible combinations for which we then can calculate our metrics described above.

# first subset. Top 10 Streaming countries
## obtain top 10 streaming countries
top10_countries<-daily_spotify[Region!="global",.(Sum=sum(Streams)),by=Region][order(-Sum)][1:10]
top10_countries<-top10_countries$Region

# get all possible combinations of two countries from this set
top10_countries_combinations<-as.data.table(combinations(length(unique(top10_countries)),2,unique(top10_countries)))
# 45 combinations in total!
setnames(top10_countries_combinations,c("V1","V2"),c("C1","C2"))

Second step

We need to calculate the distances of all the combinations above. The RPackages geos, geosphere and worldmap will help us to do so. They contain coordinates of centroids for all countries in the world and allow to calculate distances between two points. We store all distances in the table of country combinations that was calculated before.

# get world map data
wmap <- getMap(resolution="high")
# get centroids
centroids <- gCentroid(wmap, byid=TRUE)
# get a data.table with centroids
centroids_df <- as.data.frame(centroids)
centroids_dt<-as.data.table(centroids_df,keep.rownames=T)

# use iso code instead of country names and put them all in lower case
centroids_dt2<-centroids_dt
centroids_dt2$rn<-countrycode(sourcevar = centroids_dt2$rn,
                                   origin = "country.name",
                                   destination = "iso2c") 
centroids_dt2$rn<-tolower(centroids_dt2$rn)
setnames(centroids_dt2,"rn","country")

# Calculation would work as follows
# distm(centroids_dt2[country=="de",.(x,y)], centroids_dt2[country=="us",.(x,y)], fun = distHaversine)[1,1]/1000

# only use rows with relevant countries
centroids_dt2 <- centroids_dt2[centroids_dt2$country %in% daily_spotify$Region, ]

Finally we can write the distance of each pair of countries into our initial table of all combinations. A for-loop is used to get the values in column 1 and 2 and hand them over as arguments to the distm function. Using basic data.table functions we add the respective distance in a new column.

# write distance into respective row of each pair 
for (i in 1:nrow(top10_countries_combinations)) {
  top10_countries_combinations[i,distance:=(distm(centroids_dt2[country==top10_countries_combinations[i,1],.(x,y)],                                             centroids_dt2[country==top10_countries_combinations[i,2],.(x,y)], fun = distHaversine)[1,1]/1000)]
}

head(top10_countries_combinations,5)
##    C1 C2 distance
## 1: au br 15889.69
## 2: au de 14574.49
## 3: au es 15855.09
## 4: au gb 15333.46
## 5: au mx 14320.34

Third step

Now we are able to calculate all metrics and incorporate them into a copied table.

data_dataset2_top10<-top10_countries_combinations

for (i in 1:nrow(data_dataset2_top10)){
  countrya<-data_dataset2_top10[i,C1]
  countryb<-data_dataset2_top10[i,C2]
  subseta<-dataset_2[Region %in% countrya]
  subsetb<-dataset_2[Region %in% countryb]
  rowIndexa<-as.numeric(match(unique(subseta$TrackName), subseta$TrackName))
  rowIndexb<-as.numeric(match(unique(subsetb$TrackName), subsetb$TrackName))
  dataset_a_date<-subseta[rowIndexa,.(TrackName,Date)]
  dataset_b_date<-subsetb[rowIndexb,.(TrackName,Date)]
  tracks_a_b<-merge(dataset_a_date,dataset_b_date,by='TrackName',all=F)
  tracks_a_b[,difference:=(Date.y-Date.x)]
  data_dataset2_top10[i,similarity:=length(intersect(dataset_a_date$TrackName, dataset_b_date$TrackName))/
                  (length(dataset_a_date$TrackName)+length(dataset_b_date$TrackName)-length(intersect(dataset_a_date$TrackName, dataset_b_date$TrackName)))]
  data_dataset2_top10[i,day_zero:=length(which(tracks_a_b$difference==0))/length(tracks_a_b$difference)]
  data_dataset2_top10[i,time_difference:=mean(tracks_a_b$difference)]
}

We plot our three metrics against the distance after cleaning it and perform regressions on it. On the 45 datapoints of the top 10 countries the regression parameter suggests that similarity rises with rising distance, however the parameter isn’t significant at all. The same applies to the time-difference of ranking appearance. We could keep our initial hypohtesis, but we assume that our dataset is a little small since it only contains 10 countries. We repeat the process for all regions in the intital data.

# prepare dataset for further analyses
data_dataset2_top10_clean<-unite(data_dataset2_top10, "country_combination",c(C1,C2), sep = "_", remove = T)
data_dataset2_top10_clean$time_difference<-as.numeric(data_dataset2_top10_clean$time_difference)


# plot scatter and do regressions
ggplot(data_dataset2_top10_clean,aes(distance,similarity))+geom_point()+geom_smooth(method='lm')+theme_bw()

ggplot(data_dataset2_top10_clean,aes(distance,day_zero))+geom_point()+geom_smooth(method='lm')+theme_bw()

ggplot(data=data_dataset2_top10_clean,aes(distance,time_difference))+geom_point()+geom_smooth(method='lm')+theme_bw()

m1=lm(data = data_dataset2_top10_clean,similarity~distance)
summary(m1)
## 
## Call:
## lm(formula = similarity ~ distance, data = data_dataset2_top10_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.07935 -0.02867 -0.01185  0.02648  0.14461 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.914e-01  1.650e-02  11.601  7.9e-15 ***
## distance    2.890e-06  1.684e-06   1.716   0.0934 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05571 on 43 degrees of freedom
## Multiple R-squared:  0.06407,    Adjusted R-squared:  0.0423 
## F-statistic: 2.943 on 1 and 43 DF,  p-value: 0.09343
m2=lm(data = data_dataset2_top10_clean,day_zero~distance)
summary(m2)
## 
## Call:
## lm(formula = day_zero ~ distance, data = data_dataset2_top10_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.080677 -0.031000 -0.003472  0.032734  0.079710 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.036e-01  1.229e-02   49.11  < 2e-16 ***
## distance    -4.730e-06  1.255e-06   -3.77 0.000494 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0415 on 43 degrees of freedom
## Multiple R-squared:  0.2484, Adjusted R-squared:  0.2309 
## F-statistic: 14.21 on 1 and 43 DF,  p-value: 0.0004937
m3=lm(data = data_dataset2_top10_clean,time_difference~distance)
summary(m3)
## 
## Call:
## lm(formula = time_difference ~ distance, data = data_dataset2_top10_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6405  -4.8177   0.1416   5.0080  19.1421 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.1456617  2.1029030  -1.496    0.142
## distance     0.0002027  0.0002147   0.944    0.350
## 
## Residual standard error: 7.101 on 43 degrees of freedom
## Multiple R-squared:  0.02032,    Adjusted R-squared:  -0.002466 
## F-statistic: 0.8918 on 1 and 43 DF,  p-value: 0.3503

Distance comparison for entire dataset

Firstly, we will calculate all possible combinations and write the distances between those into the resulting table. This time we achieve 1378 possible two-country combinations

#exluding the region global from the data
dataset_2_nonglobal<-dataset_2[Region!="global"]
all_country_combinations<-as.data.table(combinations(length(unique(dataset_2_nonglobal$Region)),2,unique(dataset_2_nonglobal$Region)))
setnames(all_country_combinations,c("V1","V2"),c("C1","C2"))

# write distance into respective row of each pair 
for (i in 1:nrow(all_country_combinations)) {
  all_country_combinations[i,distance:=(distm(centroids_dt2[country==all_country_combinations[i,1],.(x,y)], 
                                            centroids_dt2[country==all_country_combinations[i,2],.(x,y)], fun = distHaversine)[1,1]/1000)]
}

Next, for every combination our metrics are calculated and written into the data.table

data_dataset2<-all_country_combinations

#### function to do the calculations above for all possible combinations
### attention on Run-Time: Took some minutes on a quad-core i7!

for (i in 1:nrow(data_dataset2)){
  countrya<-data_dataset2[i,C1] #get country code in column 1 as value
  countryb<-data_dataset2[i,C2] #get country code in column 2 as value
  subseta<-dataset_2[Region %in% countrya] # subset initial dataset 2 by countrya
  subsetb<-dataset_2[Region %in% countryb] # subset initial dataset 2 by countryb
  
  # get row number of first appearance of a track for both datasets
  rowIndexa<-as.numeric(match(unique(subseta$TrackName), subseta$TrackName))
  rowIndexb<-as.numeric(match(unique(subsetb$TrackName), subsetb$TrackName))
  
  #create a new subset which contains Trackname and date of 1st appearance for all tracks in country
  dataset_a_date<-subseta[rowIndexa,.(TrackName,Date)]
  dataset_b_date<-subsetb[rowIndexb,.(TrackName,Date)]
  
  # merge tracks which exist in both countries into one table
  tracks_a_b<-merge(dataset_a_date,dataset_b_date,by='TrackName',all=F)
  #calculate time difference of first appearance of a track between the two countries in table 
  tracks_a_b[,difference:=(Date.y-Date.x)]
  
  #calculate similarity
  data_dataset2[i,similarity:=length(intersect(dataset_a_date$TrackName, dataset_b_date$TrackName))/
                  (length(dataset_a_date$TrackName)+length(dataset_b_date$TrackName)-length(intersect(dataset_a_date$TrackName, dataset_b_date$TrackName)))]
  # calculate day_zero share and time_difference as mean of all time_differences
  data_dataset2[i,day_zero:=length(which(tracks_a_b$difference==0))/length(tracks_a_b$difference)]
  data_dataset2[i,time_difference:=mean(tracks_a_b$difference)]
}

Again, we can plot the results and perform regressions on the scatter plots. This time we achieve a negative relationship between distance and similarity, which also is very significant! The same applies to our other two metrics.

# prepare dataset for further analyses
data_dataset2_clean<-unite(data_dataset2, "country_combination",c(C1,C2), sep = "_", remove = T)
data_dataset2_clean$time_difference<-as.numeric(data_dataset2$time_difference)

# plot scatter and do regressions
# relationship between distance and similarity
ggplot(data_dataset2_clean,aes(distance,similarity))+geom_point()+geom_smooth(method='lm')+theme_bw()

m=lm(data = data_dataset2_clean,similarity~distance)
summary(m)
## 
## Call:
## lm(formula = similarity ~ distance, data = data_dataset2_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22042 -0.07254 -0.02415  0.05047  0.39322 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.985e-01  4.987e-03   59.86   <2e-16 ***
## distance    -8.500e-06  5.264e-07  -16.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0995 on 1376 degrees of freedom
## Multiple R-squared:  0.1593, Adjusted R-squared:  0.1587 
## F-statistic: 260.7 on 1 and 1376 DF,  p-value: < 2.2e-16
# relationship between distance and time_lag
ggplot(data_dataset2_clean,aes(distance,time_difference))+geom_point()+geom_smooth(method='lm')+theme_bw()

m2=lm(data = data_dataset2_clean,time_difference~distance)
summary(m2)
## 
## Call:
## lm(formula = time_difference ~ distance, data = data_dataset2_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.5542  -6.0322  -0.4456   5.9101  23.4655 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.859e+00  4.226e-01   4.400 1.17e-05 ***
## distance    -2.299e-04  4.461e-05  -5.153 2.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.432 on 1376 degrees of freedom
## Multiple R-squared:  0.01893,    Adjusted R-squared:  0.01822 
## F-statistic: 26.55 on 1 and 1376 DF,  p-value: 2.935e-07
# relationship between distance and share of same-day top200 entries
ggplot(data_dataset2_clean,aes(distance,day_zero))+geom_point()+geom_smooth(method='lm')+theme_bw()

m3=lm(data = data_dataset2_clean,day_zero~distance)
summary(m3)
## 
## Call:
## lm(formula = day_zero ~ distance, data = data_dataset2_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40810 -0.04904  0.00383  0.06233  0.24190 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.450e-01  4.495e-03 121.249   <2e-16 ***
## distance    -4.623e-06  4.745e-07  -9.743   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08969 on 1376 degrees of freedom
## Multiple R-squared:  0.06453,    Adjusted R-squared:  0.06385 
## F-statistic: 94.92 on 1 and 1376 DF,  p-value: < 2.2e-16

The best describing relationship however would not be a linear one. Also it seems that similarity is only significantly higher for countries somehow close to each other. We could partion the data into 2 clusters of near and mid to high-distance country-pairs. It seems that for countries which are close to each other, there are some combinations (potentially neighboring countries) who share more tracks in their top 200 than countries which are further away from each other.

## custering

ggplot(data_dataset2_clean,aes(distance,similarity))+geom_point()+geom_smooth(method='loess')+theme_bw()

k<-2
clust_km<-kmeans(data_dataset2_clean[,c("distance","similarity")],k)
cluster_dataset2_clean <- cbind(data_dataset2_clean, cluster=factor(clust_km$cluster))
ggplot(cluster_dataset2_clean,aes(distance,similarity))+geom_point(aes(color=cluster))+theme(legend.position="none")

Conclusion

After the analysis of the second dataset described above we were able to derive some key findings:

  1. Tracks move over the top 200 very differently. Some do enter top positions of rankings, rock the rankings for many weeks and then slowly descent. Others enter at a low position and gradually move up. Also, there are songs which reenter the Top200 after they have left them a few weeks before. Many unconsidered factors such as release dates of albums, singles and videos, artist promotions, tour dates and curated playlist potentially have an influence on this movement.

  2. The number of days which a song stays in a Top Ranking position heavily depends on the country you are looking at. Countries in the Americas tend to like songs in the Top-Ranking for a longer time than in countries in Europe for example. Maybe European countries are more influenced by American artists than the ohter way round.

  3. The Similarity of rankings between entire continents is pretty low.

  4. For the entire dataset, distance does affect the similarity of the Top 200 between two countries to some extent. However it seems that this tends to be pronounced only to a certain level of distance (~6000km), whereas at any distance above that, the distance does not really affect similarity of the Top200 between two countries. We assume, that countries within one continent somehow share more songs in their ratings, which are represented in the top-left corner of the distance vs similarity chart.